library(tidyverse)
library(WDI)
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: EDA)
GDP, PPP (constant 2017 international $): NY.GDP.MKTP.PP.KD
Population, total: SP.POP.TOTL
Calculate GDP per Capita
GDP, PPP (constant 2017 international $) PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.MKTP.PP.KD
Population, total Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. ID: SP.POP.TOTL
df_gdppcap <- WDI(indicator = c(gdp = "NY.GDP.MKTP.PP.KD", pop = "SP.POP.TOTL", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_gdppcap, "data/gdppcap.csv")
df_gdppcap <- read_csv("data/gdppcap.csv")
Rows: 16758 Columns: 15── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (6): year, gdp, pop, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Check the data frame by View(df_gdppcap).
str(df_gdppcap)
spc_tbl_ [16,758 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ iso2c : chr [1:16758] "AF" "AF" "AF" "AF" ...
$ iso3c : chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
$ year : num [1:16758] 2014 2012 2009 2013 1971 ...
$ status : logi [1:16758] NA NA NA NA NA NA ...
$ lastupdated: Date[1:16758], format: "2023-09-19" "2023-09-19" "2023-09-19" ...
$ gdp : num [1:16758] 7.02e+10 6.47e+10 4.99e+10 6.83e+10 NA ...
$ pop : num [1:16758] 32716210 30466479 27385307 31541209 11015857 ...
$ gdppcap : num [1:16758] 2144 2123 1824 2165 NA ...
$ region : chr [1:16758] "South Asia" "South Asia" "South Asia" "South Asia" ...
$ capital : chr [1:16758] "Kabul" "Kabul" "Kabul" "Kabul" ...
$ longitude : num [1:16758] 69.2 69.2 69.2 69.2 69.2 ...
$ latitude : num [1:16758] 34.5 34.5 34.5 34.5 34.5 ...
$ income : chr [1:16758] "Low income" "Low income" "Low income" "Low income" ...
$ lending : chr [1:16758] "IDA" "IDA" "IDA" "IDA" ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. iso2c = col_character(),
.. iso3c = col_character(),
.. year = col_double(),
.. status = col_logical(),
.. lastupdated = col_date(format = ""),
.. gdp = col_double(),
.. pop = col_double(),
.. gdppcap = col_double(),
.. region = col_character(),
.. capital = col_character(),
.. longitude = col_double(),
.. latitude = col_double(),
.. income = col_character(),
.. lending = col_character()
.. )
- attr(*, "problems")=<externalptr>
df_gdppcap |> select(region, income, lending) |> lapply(unique)
$region
[1] "South Asia" "Aggregates" "Europe & Central Asia"
[4] "Middle East & North Africa" "East Asia & Pacific" "Sub-Saharan Africa"
[7] "Latin America & Caribbean" "North America" NA
$income
[1] "Low income" "Aggregates" "Upper middle income" "Lower middle income"
[5] "High income" NA "Not classified"
$lending
[1] "IDA" "Aggregates" "IBRD" "Not classified" "Blend"
[6] NA
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, gdppcap)) + geom_line()
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, pop)) + geom_line()
Write your observations and questions.
df_gdppcap2 <- df_gdppcap |> drop_na(pop) |>
mutate(PCAP = gdp/pop, .after = gdppcap)
df_gdppcap2
df_gdppcap2 |> drop_na(gdppcap, PCAP) |> mutate(near = near(gdppcap, PCAP)) |>
summarize(numberofdata = n(), sum(near))
df_gdppcap2 |> filter(!near(gdppcap, PCAP))
Write your observations and questions.
Two useful questions.
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
See Link.
arrange(desc(gdp)) is to reorder in descending order of
gdp, arrange(gdp) in ascending order.
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(gdp) |> arrange(desc(gdp))
Find the top 10 of the countries with the highest GDP per capita.
Find the top 10 of the countries with the lowest GDP per capita.
Find the top 10 of the countries with the largest population.
Find the top 10 of the countries with the smallest population.
What type of covariation occurs between my variables?
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point()
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |>
ggplot(aes(pop, gdp, color = region)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp, color = region, shape = income)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |>
drop_na(gdp, gdppcap, pop) |>
ggplot(aes(gdppcap, gdp, color = region, size = pop)) + geom_point() +
scale_x_log10() + scale_y_log10()
install.packages("plotly")
library(plotly)
test <- df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |>
ggplot(aes(color = country, shape = region, pop, gdp)) + geom_point() +
scale_x_log10() + scale_y_log10() + theme(legend.position = "none")
test |> ggplotly()
Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6
becomes difficult to discriminate; you have 7. Consider specifying shapes manually if
you must have them.
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_histogram()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram() + scale_x_log10()
geom_histogram(bins = 20), etc.df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram(bins = 20) + scale_x_log10()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_histogram(binwidth = 10000)
scale_x_log10() and adjust the number of bins.df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_histogram(bins = 10) + scale_x_log10()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(pop) |>
ggplot(aes(pop)) + geom_histogram(bins = 20) + scale_x_log10()
df_gdppcap |> filter(year == 2022,region != "Aggregates") |> drop_na(pop) |>
# group_by(region) |>
ggplot(aes(pop, fill = region)) + geom_histogram(col = "black", linewidth = 0.2) + scale_x_log10()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(pop) |>
# group_by(region) |>
ggplot(aes(pop, fill = income)) + geom_histogram(col = "black", linewidth = 0.2) + scale_x_log10()
df_gdppcap |> filter(income == "Not classified") |> distinct(country) |> pull()
[1] "Venezuela, RB"
df_gdppcap |> filter(year == 2022, region != "Aggregates", income != "Not classified") |> drop_na(pop) |>
# group_by(region) |>
ggplot(aes(pop, fill = income)) + geom_histogram(col = "black", linewidth = 0.2) + scale_x_log10()
df_gdppcap2 |> filter(year %in% c(1990,2000, 2010, 2020)) |> drop_na(gdppcap) |>
ggplot(aes(gdppcap, factor(year))) + geom_boxplot() + scale_x_log10()
df_gdppcap2 |> filter(year %in% c(1990,2000, 2010, 2020)) |> drop_na(gdppcap) |>
ggplot(aes(gdppcap, factor(year))) + geom_boxplot() + scale_x_log10() +
labs(title = "Distribution of the GDP per Capita of Countries", subtitle = "Year 1990, 2000, 2010, 2020",
y = "Year", x = "GDP per capita in log10 scale")
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |>
filter(income != "Aggregates") |>
ggplot(aes(gdppcap, income, fill = income)) + geom_boxplot() +
scale_x_log10() +
theme(legend.position = "none")
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |>
filter(income != "Aggregates") |>
ggplot(aes(gdppcap, factor(income, levels = INCOME)), fill = income) + geom_boxplot() + scale_x_log10() +
labs(y = "") +
theme(legend.position = "none")
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |>
filter(income != "Aggregates") |>
ggplot(aes(gdp, region, fill = region)) + geom_boxplot() + scale_x_log10() +
theme(legend.position = "none")
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |>
filter(income != "Aggregates") |>
filter(region == "Sub-Saharan Africa") |>
# "Middle East & North Africa", "Sub-Saharan Africa"
arrange(gdp)
CO2 emissions (metric tons per capita): EN.ATM.CO2E.PC
GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD
CO2 emissions (metric tons per capita) Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring. EN.ATM.CO2E.PC
GDP per capita, PPP (constant 2017 international $) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD
df_co2gdp <- WDI(indicator = c(co2pcap = "EN.ATM.CO2E.PC", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_co2gdp, "data/co2gdp.csv")
df_co2gdp <- read_csv("data/co2gdp.csv")
Rows: 16758 Columns: 14── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, co2pcap, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
COUNTRY <- "World"
df_co2gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, co2pcap)) + geom_line()
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_co2gdp |> filter(iso2c %in% ISO2C) |> drop_na(co2pcap) |>
ggplot(aes(year, co2pcap, linetype = iso2c)) + geom_line()
iso2c codes to those you want to investigate.
Use df_codes under Environmentlinetype to col.ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_co2gdp |> filter(iso2c %in% ISO2C) |> drop_na(co2pcap) |>
ggplot(aes(year, co2pcap, col = iso2c)) + geom_line()
df_co2gdp |> filter(year == 2020) |> drop_na(co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point()
df_co2gdp |> filter(year == 2020) |>
drop_na(gdppcap, co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_co2gdp |> filter(year == 2020) |>
drop_na(gdppcap, co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point() +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10() + scale_y_log10()
df_co2gdp |> filter(year == 2020) |> drop_na(gdppcap, co2pcap) |>
lm(log10(co2pcap)~log10(gdppcap), data = _) |> summary()
Call:
lm(formula = log10(co2pcap) ~ log10(gdppcap), data = drop_na(filter(df_co2gdp,
year == 2020), gdppcap, co2pcap))
Residuals:
Min 1Q Median 3Q Max
-0.60778 -0.15660 -0.00651 0.16129 0.59437
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.31545 0.13386 -32.24 <2e-16 ***
log10(gdppcap) 1.13831 0.03288 34.62 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2362 on 228 degrees of freedom
Multiple R-squared: 0.8402, Adjusted R-squared: 0.8395
F-statistic: 1199 on 1 and 228 DF, p-value: < 2.2e-16
School enrollment, secondary (% gross): SE.SEC.ENRR
GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD
School enrollment, secondary (% gross) Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers. SE.SEC.ENRR
GDP per capita, PPP (constant 2017 international $) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD
df_secgdp <- WDI(indicator = c(sec = "SE.SEC.ENRR", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_secgdp, "data/secgdp.csv")
df_secgdp <- read_csv("data/secgdp.csv")
Rows: 16758 Columns: 14── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, sec, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
COUNTRY <- "World"
df_secgdp |> filter(country == COUNTRY) |>
ggplot(aes(year, sec)) + geom_line()
INCOME <- c("Low income", "Low & middle income", "Lower middle income", "Middle income", "Upper middle income", "High income")
df_secgdp |> filter(country %in% INCOME) |> drop_na(sec) |>
ggplot(aes(year, sec, linetype = factor(country, levels = INCOME))) + geom_line() +
labs(linetype = "Income Levels")
Change COUNTRIES to ISO2C of countries you
want to investigate. Use df_codes under Environment
df_secgdp |> filter(year == 2020) |> drop_na(sec) |>
ggplot(aes(gdppcap, sec)) + geom_point()
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
ggplot(aes(gdppcap, sec)) + geom_point() +
scale_x_log10()
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
ggplot(aes(gdppcap, sec)) + geom_point() +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10()
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
lm(sec~log10(gdppcap), data = _) |> summary()
Call:
lm(formula = sec ~ log10(gdppcap), data = drop_na(filter(df_secgdp,
year == 2020), gdppcap, sec))
Residuals:
Min 1Q Median 3Q Max
-53.777 -10.846 -1.173 9.006 66.996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -102.994 11.933 -8.631 6.38e-15 ***
log10(gdppcap) 46.088 2.841 16.222 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.64 on 157 degrees of freedom
Multiple R-squared: 0.6263, Adjusted R-squared: 0.624
F-statistic: 263.2 on 1 and 157 DF, p-value: < 2.2e-16
Posit Primers: Link.
Cheat Sheet: Link.
Shared Project: https://posit.cloud/content/5539763
R for Social Scientists: https://datacarpentry.org/r-socialsci/
Old Shared Project: https://rstudio.cloud/content/4858948
Data Analysis for Researchers AY2022: Link.
みんなのデータサイエンス - Data Science for All (in Japanese)
Do a similar investigation by selecting WDI codes.
Choose at least two WDI codes with their names
Name: Code:
Name: Code:
Replace the following data_frame_name and shortname1, shortname2.
The following code chunk is revised.df_dataframe_name <- WDI(indicator = c(shortname1 = "", shortname2 = "), extra = TRUE)
write_csv(df_dataframe_name, "data/dataframe_name.csv")
df_dataframe_name <- read_csv("data/dataframe_name.csv")
head(), str(), summary(), and
try df_dataframe_name
Try as many visualization as possible.
rank for each variable
line graph
scatterplot
scatterplot with a regression line
histogram
boxplot